CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

Abstract Motivation K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. Results We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. Availability and implementation A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
K-mers, contiguous strings of DNA or RNA of length k, are frequently utilized in computational biology for a variety of purposes including in genome assembly (Koren et al., 2017;Liu et al., 2012;Luo et al., 2012), metagenomic sequences classification (Dilthey et al., 2019;Ondov et al., 2016;Wood and Salzberg, 2014), motif discovery (Fletez-Brant et al., 2013;Zhang et al., 2017a) and largescale genomic comparisons (Ondov et al., 2019;Solomon and Kingsford, 2017). A number of hashing-based techniques such as MinHash (Broder, 1997), Bloom filter (Bloom, 1970) and Count-Min Sketch (Cormode and Muthukrishnan, 2005) have been developed or adopted for efficient computation of k-mer-based similarity methods. In each such application, the first step is to collect a set of k-mers from input sequences. Importantly, it has been found that algorithm performance depends critically on the choice of size k. Indeed, various heuristic and empirical strategies have been introduced to find optimal k-mer sizes that increase performance in certain application areas (Chikhi and Medvedev, 2014;Schulz et al., 2014;Zhang et al., 2017b). However, whenever a new k size is selected, each computational technique requires reconstructing the k-mer-based data structure and rerunning the analytical pipeline, leading to computational inefficiencies.
In particular, hashing-based k-mer methods that compute measures of similarity of genomic and metagenomic data (such as the Jaccard and containment indices) have been demonstrated to extract valuable insight from metagenomic data (Besta et al., 2020;Ondov et al., 2016;Pierce et al., 2019). Multiple hashing-based techniques involving the estimation of Jaccard index and/or other k-mer derivatives have been developed. For example, Mash (Ondov et al., 2019), Sourmash (Pierce et al., 2019) and Skmer (Sarmashghi et al., 2019). Several efforts have been made to improve the efficiency for single k value hashing method, such as b-bit wise MinHash (Li and Kö nig, 2011) and Dashing with HyperLogLog sketches (Baker and Langmead, 2019). In these cases too, however, each time a new k size is utilized, the entire computational processes needs to be repeated.

Motivation
In situations where reference databases can exceed several hundred gigabytes, such as in metagenomics, indexing or sketching the database multiple times for different k-mer sizes is computationally expensive and may become an analysis bottleneck. Nevertheless, adjusting k-mer sizes plays a valuable role such as in metagenomics, where selection of k-mer size can impact performance of i28 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Bioinformatics, 38, 2022, i28-i35 https://doi.org/10.1093/bioinformatics/btac237 ISCB/ISMB 2022 downstream methods (such as sensitivity or specificity of taxonomic profiling algorithms). To circumvent this, some subfields of application have proposed heuristic approaches to estimating optimal k-mer size. For example, KmerGenie (Chikhi and Medvedev, 2014) is a heuristic method to determine the optimal k value for genome assembly. Additionally, it has been recognized that it is non-trivial to find the 'right' k-mer size in practice (Marchet et al., 2020;Pierce et al., 2019;Song et al., 2014).
Besides, the choice of k is usually purpose-specific based on the compromise between sensitivity and specificity. In practice, Mash set defaults k to 21 for the purpose of controlling the probability q of observing a random k-mer under some cutoff (e.g. k ¼ 19 is corresponding to q ¼ 0.01 with genomes of $3 GB size). On the other hand, utilizing a low k value grants the tool the power to deal with variances in the real data (Ondov et al., 2019). However, the cutoffs and tolerances are dynamic: Kraken, a k-mer-based metagenomic classification tool, set the default k value to 31 for better discriminatory power on lower taxonomic ranks (Wood and Salzberg, 2014). In our previous work, k ¼ 61 was found to be empirically optimal to reflect metagenomic composition via alignment (LaPierre et al., 2020). In many computational works, the default or optimal k values are obtained through benchmarking analysis and these tools usually relax the freedom of choosing arbitrary k values by the user to fit various aims based on the datasets.
While our research is mainly about metagenomics, flexibility in different taxonomic levels is crucial. Being able to adjust k values grants us one more dimension of freedom than the similarity cutoffs: a small k value (e.g. 21) is feasible to search against the whole metagenomic database for similar matches while a large k value can discriminate the sub-structures within genus and even species. While we have observed that the rate of decrease of the Jaccard index as a function of k recapitulates the evolutionary relatedness (Koslicki and Falush, 2016), a tool that can efficiently handle the computational challenges of multiple k values can be helpful to further the exploration of metagenomic studies.

Outcome
To address this problem, we combine a modified MinHash technique (ArgMinHash) and a data structure called a k-mer ternary search tree (KTST), which allows Jaccard and containment indices to be computed at multiple k-mer sizes efficiently and simultaneously. In Figure 1, we provide a high-level description of how we accomplish this: first, we randomly subsample k-mers based on a large k size k max (Fig. 1b) to build k-mer sketches. The sketch elements (i.e. k max -mers) are then inserted into a KTST (Fig. 1c), which allows for efficient prefix lookups. A prefix lookup in the KTST effectively truncates a k max -mer resulting in a smaller k-mer (Fig. 1d). This allows us to efficiently compute k-mer sketches for every k k max (Fig. 1e). This truncation step avoids the needs to reprocessing the whole reference database for sketches with a different k size, making CMash much more efficient when handling large reference database. Combined with the containment MinHash approach (Koslicki and Zabeti, 2019), we can estimate the Jaccard and containment indices for all k < k max without requiring explicit re-computation of each single k value. More details about CMash workflow and the data processing can be found in Supplementary Figure S1.
This truncation-based method turns out to be a biased estimator of the k-mer-based similarities. However, in our empirical analysis, we find that the CMash estimate of the Jaccard and containment index does not deviate significantly from the ground truth, indicating that this approach can give fast and reliable results with minimal bias. Compared to our previous MinHash-based approximation to the containment index (Koslicki and Zabeti, 2019), we find that the CMash estimate for ten k values is approximately ten times faster and requires only one-fifth of space to store the reference database.
Importantly, this approach can be generalized to more than similarity computation: many sketching, k-mer or shingling-based approach may adopt our method to avoid the need to re-compute kmer sets when changing the k size. As such, this probabilistic data analysis approach should find application outside of metagenomics (the application we focus on) to genomics more broadly, as well as other applications that utilize a k-mer or shingling approach.
In summary, we demonstrate how this CMash technique can be applied to several widely utilized tools (e.g. Mash Screen (Ondov et al., 2016), Sourmash (Pierce et al., 2019)) and will help to speed up k-mer-based computation when multiple k sizes are needed. A proof of concept implementation of the algorithm and data structure is freely available at https://github.com/dkoslicki/CMash.

Materials and methods
Here, we describe our algorithmic approach, but first we recall a few necessary definitions.

Jaccard and containment index
In computational biology, k-mers are consecutive substrings of length k of nucleotides A ¼ fA; C; G; Tg. The similarity between genomic data can be measured by the similarity of their respective kmer sets: the collection of all distinct k-mers appearing as contiguous substrings in the data. If A is a collection of strings on the alphabet A, then A k is defined to be the set of all unique k-mers in A. In this entire section, most of the definitions given apply to arbitrary sets, but with the genomic application area in mind, we often suppress the superscript and write A instead of A k for simplicity, with the implicit understanding that a set of k-mers depends on the k value chosen.
The Jaccard index (JI) measures the similarity of two sets by comparing the relative size of the intersection over the union (Broder, 1997). For two non-empty finite sets A and B, the Jaccard index is defined as JðA; BÞ ¼ jA\Bj jA[Bj . Hence, 0 JðA; BÞ 1 with larger values indicating more overlap. Similarly, the containment index (CI) of A in B (with A non-empty) measures the relative size of the intersection over the size of A: CðA; BÞ ¼ jA\Bj jAj . So 0 CðA; BÞ 1 with larger values indicating that more content of A resides in B. If the cardinality of both A and B are known, the Jaccard index and containment index are interchangeable: JðA; BÞ ¼ jAj Á CðA; BÞ jAj þ jBj À jAj Á CðA; BÞ : (1) When applied to sets of k-mers, we call out the dependence on k with the following definitions:

Classic MinHash algorithm for the Jaccard index
For very large sets A and B (such as k-mer sets for moderate to large k derived from genomic data), computing the Jaccard index directly can be computationally taxing. To circumvent this, Broder proposed MinHash to efficiently estimate the Jaccard index for large sets (Broder, 1997). MinHash uses a random sampling process: first, we fix a constant m 2 Z þ (m is usually called sketch size) and select a family of m min-wise independent hash functions H ¼ fh 1 ; h 2 ; . . . ; h m g whose domains contain jA [ Bj. Then, we define the MinHash sketch of a set A as the element (ties can be solved by lexicographic order) in A that cause some h i to have the minimum value on A. More formally, define h min i ðAÞ ¼ argmin a2A h i ðaÞ: Next, define m random variables X ¼ fX 1 ; X 2 ; . . . ; X m g, such that: The probability of a MinHash collision (i.e. h min i ðAÞ ¼ h min i ðBÞ) is an unbiased estimate of J(A, B): In practice, a 'bottom sketch' strategy, originally proposed by Broder (1997), is commonly used to implement the MinHash algorithm. Instead of using m hash functions, all k-mers from a given set A are passed through a single hash function and the smallest m hash values (instead of elements) are stored in a sorted sketch S b ðAÞ of size m. The probability that sketch S b ðAÞ; S b ðBÞ share a hash value represents the probability of random sampling a shared (c) For some large k value k max , one and only one k-mer sketch of the reference data will be constructed and inserted into a KTST. (d) All k-mer sketches corresponding to a smaller k value will be obtained by a prefix lookup in the KTST. (e) For k < kmax, k-mers from the query data are streamed through the KTST resulting in (f) reliable estimates for a range of k-mer sizes with greater computational efficiency element from the union of set A and B. So, the resemblance of set A, B can be quickly estimated by counting the matched values between S b ðAÞ and S b ðBÞ.
This efficient approach has found use in, e.g. metagenomics where hundreds of thousands of microbial genomes may under consideration. For example, both Sourmash (Pierce et al., 2019) and Mash Screen (Ondov et al., 2016) maintain hash value sketches of all input genomes for comparison. However, k-mer information is lost during if one only considers hash values, instead of elements leading to minimal hash values. Herein, we will show how we can benefit from using a k-mer sketch instead of a hash value sketch in similarity analysis.
We now define a bottom m k-mer sketch. Let A k be the set of all k-mers derived from a set of sequences/string A and define MIN m ðA k Þ as the set of the m elements corresponding to the m smallest hash values in set fhðaÞ : a 2 A k g. Namely, for m a given sketch size and k the k-mer size, the k-mer MinHash sketch of A is defined to be We may suppress m and k for notational simplicity.

Containment MinHash
Though the MinHash approach gives an unbiased estimation of the Jaccard index, its performance may degrade considerably when A and B are of significantly different sizes (Koslicki and Zabeti, 2019). More robust estimation of J(A, B) can be obtained through C(A, B), the containment index of A in B. This strategy is called 'containment MinHash' (Koslicki and Zabeti, 2019). We detail this procedure now. Given a fixed k-mer size and two nonempty distinct sets of strings A and B on the alphabet A such that jA k j jB k j, we first compute S m k ðAÞ, the bottom sketch of the smaller set. Next, we can stream all elements in the set B over S m k ðAÞ to estimate C(A, B). Since S m k ðAÞ is a uniform random sample from set A, the proportion of elements in S m k ðAÞ that are found in set B is an unbiased estimator of the containment index. Namely, To be noted, this streaming method is an efficient algorithm for the estimation of the containment index in metagenomic settings and is utilized by Mash Screen (Ondov et al., 2016), Metalign (LaPierre et al., 2020), etc. Finally, we can take advantage of Equation (1) to compute J k ðA; BÞ based on the containment index and the cardinalities of set A and B (which can be quickly approximated by fast cardinality estimation such as Hyperloglog (Flajolet et al., 2007)). In CMash, we use this contaiment MinHash approach for JI estimation considering its metagenomic analysis setting.

CMash
The approach we call CMash consists of two main components: first is the aforementioned k-mer MinHash sketches, and second a traditional ternary search tree applied to sets of k-mers.

ArgMinHash
We now detail the first half of the CMash approach: a data structure we call 'ArgMinHash' that utilizes k-mer MinHash sketches. In particular, there is an important but subtle difference between the aforementioned MinHash bottom m sketches and the k-mer MinHash sketches utilized by CMash. In particular, the definition in Equation (5) shows that the bottom m sketch utilized by the containment MinHash (or even MinHash itself) are comprised of the smallest m hash values. In contrast, the sketches utilized by CMash are comprised of elements of a set that hash to small values. This difference is key to allowing a truncation-based approach. Indeed, if we used a sketch comprised of hash values of k-mers, truncating these hash values would have no relationship at all to the hash values obtained from truncated k-mers.
More formally, let A k be the set of all k-mers derived from a set of sequences/strings A. A hash function h with domain containing A k induces an order on A k . If collisions are present, we can impose an additional ordering (say, lexicographic) to break ties. Then we define ARGMIN h m ðA k Þ as the set of m smallest, according to the ordering imposed by h, elements of the set A k . Then for m a given sketch size and k the k-mer size, the 'argmin-bottom' sketch of A is defined to be where AMH is an abbreviation for 'ArgMinHash'.

K-mer ternary search tree
Given a set of collections of sequences D ¼ fA 1 ; . . . ; A N g, here thought of as genomes of N different (micro)organisms, we populate a single ternary search tree KTST with the sketches AMH m kmax ðA i Þ; i ¼ 1 . . . N for a fixed sketch size m and a fixed (large) k max . Recall that a ternary search tree is a data structure that allows fast (average Oðlog nÞ) lookup of prefixes so that every root to leaf path (equivalently, node) represents a k-mer. Furthermore, nodes in KTST can be labeled with which elements of D contain the prefix defined by that node. We further associated a sequence of counters c 1 i ; . . . ; c kmax i to each A i in D. We further accelerate prefix queries by populating a bloom filter with every k-mer defined by nodes in the KTST.
Note that by inserting the sketches AMH m kmax ðA i Þ into the KTST, we have effectively computed proxies to AMH m k ðA i Þ for each k k max . Indeed, we obtain new sketches for a smaller k-mer size k by truncating the KTST to a depth of k (Fig. 1d).
We can then approximate the containment index of each reference A i in some other set of sequences B (thinking of B as a large genomic dataset) in the following way: the k-mers of B for each k ¼ 1; 2; . . . k max are streamed through the KTST similar to the aforementioned Mash Screen (see Fig. 1e). When a k-mer is found to correspond to a node in the KTST, each of the counters c k i is incremented for each A i associated with that node/k-mer. After the streaming is complete, we will have that In doing so, in a single stream over the input data B, we are able to approximate C k ðA i ; BÞ for each A i and each k ¼ 1; . . . ; k max . If during the construction of the sketches AMH m k ðA i Þ, we also store the cardinality of A k i , we can obtain the Jaccard indices J k ðA i ; BÞ as well.
The ability to estimate Jaccard or contaiment indices for multiple k-mer sizes (up to some maximum k max -mer size) motivates the multi-resolution nature of CMash. Indices can be calculated for both small k-mer sizes (low resolution), and large k-mer sizes (high resolution) utilizing a single data structure.

Biased nature of the estimate
There is no reason to think that the estimate in Equation (8) will be unbiased. Indeed, while S i :¼ AMH m kmax ðA i Þ is truly a random sample of m elements from the set A kmax i and so the estimate given in Equation (8) corresponds exactly to MinHashing with k ¼ k max , truncating the elements of S i to k-length prefixes will not be a random sample of m elements from A k i due to duplicated prefixes. Consider A ¼ fAATAAGg with k max ¼ 3 and m ¼ 1: every one of the four 3-mers AAT, ATA, TAA, AAG has equal probability of being selected. As such, truncating these to 2-mers results in AA appearing with expected probability of 50% in the truncated 3-mers. In contrast, the frequency of AA in A 2 is only 25% as four distinct 2-mers A 2 ¼ fAA, AT, TA, AGg. Though the truncation step will inevitably introduce some bias in the estimation, the gain in speed overwhelms the small sacrifice to accuracy, which we empirically verify in the next sections.

Theoretical analysis of CMash
Theoretically and practically, a truncation-based estimate of the Jaccard similarity will introduce data-dependent bias. Consider an arbitrary sequence data A and let A k denote the set of all distinct kmers of length k in A. Obviously, A 1 ¼ fA; C; G; Tg. Similarly, A kþL denote the set of all distinct ðk þ LÞ-mers in A. Let ðA kþL Þ 1...k denote the distinct k-mers obtained by directly truncating all elements in A kþL from length ðk þ LÞ to k. In an ideal situation where no two elements share the same prefix of length k, the truncated kmer set ðA kþL Þ 1...k is exactly A k . Unfortunately, this will not happen in most cases where duplicate prefixes will be introduced during the truncation, leading to estimation deviance. In this section, we will show how this truncation-introduced bias correlates with the truncation length L as well as the input data themselves.

Bias in truncation-based Jaccard index
First we define a prefix relationship between two k-mers of different lengths. For k-mer M kþL of length k þ L and N k of length k, if N k is a prefix of M kþL , we can truncate M kþL by length L to get N k , which is written as N k ¼ ðM kþL Þ 1...k . We may suppress the length subscript for notational simplicity. Namely: N ¼ M 1...k .
We then define right extensions: for a given k-mer X of length k, and L 2 N, we use RE L A kþL ðXÞ to denote all ðk þ LÞ-mers in the set A kþL that have X as prefix. That is to say,

RE L
A kþL ðXÞ ¼ jfw 2 A kþL s:t: w 1;...;k ¼ Xgj Now, we can quantitatively describe the bias in the truncationbased method. Let H be a family of suitable hash functions (e.g. min-wise independent) and h min i ðAÞ be the element in the set A that minimize the hash values: h min i ðAÞ ¼ argmin a2A h i ðaÞ: Given two arbitrary non-empty genome/sequence files A and B (and the k-mer sets A kþL and B kþL for any arbitrary positive integers k, L), if we truncate all k-mers from length k þ L to k, the truncation-based Jaccard index truncated from k þ L to k, denoted by JIðA; BÞ truncðkþL!kÞ can be computed in the following: JIðA; BÞ truncðkþL!kÞ (10) Note that we must truncate larger k-mer values instead of extending smaller k-mer values as the latter would require an additional pass over the input data. However, after JIðA; BÞ k is computed in a top-down fashion from a large k value, the k can be freely set to any value smaller than the initial large input k size.
Briefly, the truncation-based estimation of Jaccard index utilized by CMash will bring a multiplicative bias factor upon the classic MinHash estimation of the Jaccard index as shown in Equation (18). This bias factor reflects the imbalance of prefixes (i.e. truncated k-mers) distributions between the intersection and the union of the original k-mer sets (before truncation). The bias factor implies that CMash will be more reliable when there are few duplication after truncation (in this case, expected number of right extension of any prefix tends to be 1) or when A and B have relative high similarity (namely, the intersection well represents the union). In other words, the truncation-based method might be limited when either k or true JI values are small. Furthermore, as a multiplicative bias is introduced during truncation, the further a truncated k is from the larger k þ L used to construct the KTST, the bias will also increase. This can be seen in Supplementary Figure S4.
To demonstrate that CMash is reliable when using large k values or running with closely related data, we applied this approach to 31 genomes within the genus Brucella with multiple k values (Fig. 2) and showed that CMash can robustly function to estimate Jaccard indices for multiple k values simultaneously. Considering the unavoidable variance due to random sampling in MinHash algorithm, the bias in CMash may not be an obstacle empirically.

Bias in truncation-based containment index
Similar to Mash Screen (Ondov et al., 2016), when computing the containment index (CI) of B k in the set A k , it is practically more convenient to form only a sketch SðA k Þ of A k and then stream the elements of B k over it, looking for matches. Hence, truncation of elements of B k is not necessary. We can then connect this streaming, truncation-based estimate of the containment index to the classic MinHash algorithm directly. To that end, let m be the given sketch size, and compute: CIðA; BÞ truncðkþL!kÞ (19) where a ¼ jSðA k Þj À jSðA kþL Þ 1...k j refers to the number of duplicate k-mers (prefixes) generated during truncating the k-mer sketches of A; and b ¼ jSðA k Þ \ B k j À jSðA kþL Þ 1...k \ B k j refers to the difference of cardinality of overlapping elements between the untruncated sketch and truncated sketch with B. Although the truncated sketch is not exact the same as the bottom sketch SðA k Þ, the differences are negligible in practice due to the uniformity of the hash function(s), as we note below in Section 3. Similar to the truncated Jaccard index, the CMash estimate of the containment index will lead to a data-dependent bias factor that relies on the original k-mer length, the truncation length as well as the k-mer distribution in the input data themselves. The bias factor can be minimized when there are few duplicate prefixes (i.e. using large k values). Besides, a larger sketch size m can overwhelm the value of a and b, making the bias negligible in practice. The performance of CMash on truncated CI is examined in Figure 3a, showing reliable estimation while being more efficient in metagenomic settings.

Results
Here, we compare the results from CMash using the truncationbased method to both the classic MinHash estimation as well as the ground truth (brute force) calculation on real and simulated data. For a proof-of-concept purpose, we coded both CMash and the classic MinHash (which has been adopted by many existing tools) via Python to perform a fair comparison. The comparison of CMash to Sourmash and Mash can be found in Supplementary Figure S3.

CMash accurately estimates the Jaccard index
Considering the metagenomic setting where researchers are less interested in distal relationships, we benchmarked the efficacy of CMash on a collection of organisms all belonging to the same genus: we selected the genus Brucella. A total of 31 complete or scaffold genomes were found in the NCBI GenBank database (Benson et al., 2018) and all were downloaded, except for a single genome belonging to the species Brucella intermedia which was discarded due to its large evolutionary distance to the remaining 30 Brucella genomes. To assess the ability of CMash to estimate the Jaccard index, we computed all pairwise Jaccard indices for this set of genomes and compared them to the ground truth Jaccard indices which were computed in a brute force fashion. Figure 2 contains the results where the k-mer size ranged from 15 to 60 in steps of 5, for a k max value of k max ¼ 60. The sketch size for CMash was m ¼ 2000 by default (estimation variance decreases exponentially with increased sketch size while the computation becomes less time-/space-efficient (Koslicki and Zabeti, 2019)). We use canonical k-mers throughout: the lexicographic minimum of the k-mer and its reverse complement.
As expected, Figure 2a shows that the Jaccard index between pairs of genomes decreases as k increases. Indeed, for a k-mer size equal to an input genome's length, the Jaccard index at this k-mer size is equivalent to exact string matching. As an aside, the rate of decrease of the Jaccard index as a function of k recapitulates the evolutionary relatedness first observed in Koslicki and Falush (2016), which motivated the investigation contained in this article.
The differences between the CMash estimate when compared with the ground truth are explained by two components: the variance introduced by the sampling-based approach of the ArgMinHash component of CMash, and the bias introduced by truncating the KTST. As seen in Figure 2b and c, neither of these biases are significant when estimating the Jaccard index with CMash when comparing pairs of genomes with medium or high Jaccard similarity. The higher relative error in Figure 2c for large k size is due to the decrease in the ground truth Jaccard index values as shown in Figure 2a. CMash and the MinHash estimate exactly agree at k ¼ k max , so the performance characteristics are already well studied in this setting (e.g. Koslicki and Zabeti, 2019). Indeed, Figure 2b shows that the absolute differences are tightly distributed around zero. Figure 2d and e depicts the performance of the classic MinHash method for comparison. The lower variance of CMash estimation is achieved through the containment MinHash method in Section 2.2 (Koslicki and Zabeti, 2019).

CMash is significantly more efficient than MinHash
The large size of microbial genome reference databases is a constraint in metagenomic analyses due to database size directly impacting computational time. This is especially a concern when multiple k values are required (Koslicki and Falush, 2016;Pierce et al., 2019;Rana et al., 2016).
To examine the ability of CMash to ameliorate these concerns with large reference databases, we analyzed simulated metagenomic reads for the containment estimation of selected reference genomes. Among all species with complete or scaffold genomes in the NCBI GenBank database (Benson et al., 2018), we randomly selected 1000 of them spanning 26 phyla, 174 families and 313 genera to serve as a reference database. Next, 200 of these 1000 genomes were used to simulate metagenomic samples. We used BBTools randomreads.sh (Bushnell, 2018) with the default metagenomic setting to simulate these datasets. In total, ten metagenomic datasets with depths ranging from 2 million reads to 20 million reads were simulated and then processed by CMash. We compared the CMash truncation-based estimate to the classic MinHash algorithm in a direct comparison: both algorithms were coded in the same programming language and using the same hash function. The choice of k values was slightly different than before: k values ranging from 20 to 60 in steps of 5 were used and k ¼ 15 was excluded because the probability of sharing a 15-mer merely by chance is not negligible in the metagenomic setting where the genome pool tends to be large.
We used the containment index (CI) in this experiment due to the very different sizes of the input data: it has been established that hashing approaches more accurately estimate the containment index in this such situations (Koslicki and Zabeti, 2019), though recall that Jaccard and containment indices can be computed from each other when the cardinalities are known (see Equation (1)). Considering that the major interests in metagenomic anlaysis are for microbes which show up in the sample (usually with moderate or high CI values) and k-mer matches from related or random genomes is unavoidable, we compared absolute difference of CI values between CMash and the classic MinHash algorithm for all the 1000 reference genomes. While the performances from different depths are similar, we only present the results for the depth of 10 million reads. The results, in Figure 3a, show that most of the absolute CI difference falls below 0.02, suggesting that CMash consistently agrees with the classic MinHash algorithm for all k values considered. In this figure, we only compare CMash and classic MinHash as the comparison to the ground truth is shown in Figure 2.
Given the comparable performance, the CMash results were obtained more efficiently in terms of space and running time. CMash requires only one reference database for the estimation of all k k max while the classic MinHash requires space linear in the number of k values (Fig. 3b). While CMash is currently a prototype model, the classic MinHash method is not implemented in the most memory efficient way (both hash values and k-mers are stored). Though the cost of the classic MinHash method was overestimated, the superiority of CMash in dealing with multiple k values is significant. In this experiment that used 10 k values, CMash used a total of 176 MB in space for the reference database while the classic MinHash approach used 947 MB to store all of its sketches. In addition, due to not needing to reconstruct new sketches for new k values, we observe in Figure 3c that the time needed for reference construction by CMash was almost one-tenth compared to the classic MinHash. The estimation portion, depicted in Figure 3d, was negligible in comparison to the database construction time, but here too we found CMash to be more efficient than the MinHash approach.

Conclusion
In this article, we introduced CMash: an algorithm and data structure that can provide efficient multi-resolution estimation of k-mer-based Jaccard and containment indices. It combines a bottom 'argmin sketch' strategy and a prefix lookup in a KTST to avoid the reconstruction of sketches for the entire reference databases each time the k-mer size is changed. One advantage of using a KTST not explored here is that a KTST can be represented as an on-disk database, thus freeing memory for other purposes. Indeed, the minimum memory needed is the size of one leaf node in the pre-built KTST. If needed, the amount of memory utilized by a KTST can be adjusted for a trade-off between speed and memory usage. We showed that this truncation-based method can provide results that well-approximate the ground truth in a more computationally efficient manner. We used CMash to analyze real microbial data and simulated metagenomic data and found it to give consistent and reliable estimates. While not an unbiased estimate for k-mer sizes smaller than the input maximum k max value, we observed that the introduced bias was negligible for genomes with moderate and high Jaccard indices.
The required space used by this approach is constant when we fix the choice of k max , regardless of the number of different k values that we are interested to explore. In contrast, a classic MinHash method requires space that is linear to the number of k values used. Similarly, the time to construct reference database is significantly improved compared to MinHash as CMash only needs to proceed the data once; hence the total running time are effectively linearly reduced with respect to the number of k-mer sizes when compared with MinHash. This feature is extremely helpful in metagenomic analysis where the reference database can be as large as hundreds gigabytes and the querying cost can be overwhelmed by reference construction cost.
Besides the algorithmic improvement on the MinHash algorithm, CMash can take advantage of other k-mer sketching methods in which truncated sketches (from prefix lookup) remain/are close to a random sample which can be then used for containment MinHash estimation (Koslicki and Zabeti, 2019). For example, spaced k-mers can be used to replace contiguous k-mers as spaced k-mers show potentials for improving metagenomic analysis (Boden et al., 2013;B rinda et al., 2015). When dealing with insertions and deletions, Strobemer (Sahlin, 2021) can be adopted if we only truncate with a length equal to the strobe length. However, care needs to be taken when a truncation sketch is not (nearly) a random sample. CMash is not capable of dealing with weighted Jaccard index and Order Min Hash (Marc¸ais et al., 2019) as the weight/ordinal information cannot be inherited during truncation, leading to erroneous estimation.
In the future, further study of the bias factor may enhance the usability of CMash. Inspired by the tight empirical distribution of the bias factor in Supplementary Figure S4, we are interested in proving some statistical error boundary for CMash estimation regarding the bias based on assumptions of similarity level. This may explain why the measured bias factor is so much better than the current theory suggests. We believe this method will be useful in many metagenomic analyses where multi-resolution estimates can illuminate evolutionary relationships. Beyond metagenomics, k-mer (or shingling)-based methods are utilized extensively in computer and data science, so the CMash approach should find application beyond computational biology by essentially allowing multi-resolution (in terms of k-mer/shingling size) queries with little sacrifice to accuracy but greatly improved efficiency.